{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data Class"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import random\n",
    "\n",
    "class Sentiment:\n",
    "    NEGATIVE = \"NEGATIVE\"\n",
    "    NEUTRAL = \"NEUTRAL\"\n",
    "    POSITIVE = \"POSITIVE\"\n",
    "\n",
    "class Review:\n",
    "    def __init__(self, text, score):\n",
    "        self.text = text\n",
    "        self.score = score\n",
    "        self.sentiment = self.get_sentiment()\n",
    "        \n",
    "    def get_sentiment(self):\n",
    "        if self.score <= 2:\n",
    "            return Sentiment.NEGATIVE\n",
    "        elif self.score == 3:\n",
    "            return Sentiment.NEUTRAL\n",
    "        else: #Score of 4 or 5\n",
    "            return Sentiment.POSITIVE\n",
    "\n",
    "class ReviewContainer:\n",
    "    def __init__(self, reviews):\n",
    "        self.reviews = reviews\n",
    "        \n",
    "    def get_text(self):\n",
    "        return [x.text for x in self.reviews]\n",
    "    \n",
    "    def get_sentiment(self):\n",
    "        return [x.sentiment for x in self.reviews]\n",
    "        \n",
    "    def evenly_distribute(self):\n",
    "        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))\n",
    "        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))\n",
    "        positive_shrunk = positive[:len(negative)]\n",
    "        self.reviews = negative + positive_shrunk\n",
    "        random.shuffle(self.reviews)\n",
    "        \n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'I hoped for Mia to have some peace in this book, but her story is so real and raw.  Broken World was so touching and emotional because you go from Mia\\'s trauma to her trying to cope.  I love the way the story displays how there is no \"just bouncing back\" from being sexually assaulted.  Mia showed us how those demons come for you every day and how sometimes they best you. I was so in the moment with Broken World and hurt with Mia because she was surrounded by people but so alone and I understood her feelings.  I found myself wishing I could give her some of my courage and strength or even just to be there for her.  Thank you Lizzy for putting a great character\\'s voice on a strong subject and making it so that other peoples story may be heard through Mia\\'s.'"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "file_name = './data/sentiment/books_small_10000.json'\n",
    "\n",
    "reviews = []\n",
    "with open(file_name) as f:\n",
    "    for line in f:\n",
    "        review = json.loads(line)\n",
    "        reviews.append(Review(review['reviewText'], review['overall']))\n",
    "        \n",
    "reviews[5].text\n",
    "        "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Prep Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "training, test = train_test_split(reviews, test_size=0.33, random_state=42)\n",
    "\n",
    "train_container = ReviewContainer(training)\n",
    "\n",
    "test_container = ReviewContainer(test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "436\n",
      "436\n"
     ]
    }
   ],
   "source": [
    "train_container.evenly_distribute()\n",
    "train_x = train_container.get_text()\n",
    "train_y = train_container.get_sentiment()\n",
    "\n",
    "test_container.evenly_distribute()\n",
    "test_x = test_container.get_text()\n",
    "test_y = test_container.get_sentiment()\n",
    "\n",
    "print(train_y.count(Sentiment.POSITIVE))\n",
    "print(train_y.count(Sentiment.NEGATIVE))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Bag of words vectorization"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I read this book over a year ago & enjoyed the various stories, the author takes you on a journey of life as it pretty much is in today's world & society, as you end one story you look forward to starting the next, relaxed reading I highly recommend it for peps who enjoy stories from back in their grand-ma & grand-dad days in the South.  I will peruse more books by this author for future purchase.\n",
      "[[0. 0. 0. ... 0. 0. 0.]]\n"
     ]
    }
   ],
   "source": [
    "from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer\n",
    "\n",
    "# This book is great !\n",
    "# This book was so bad\n",
    "\n",
    "vectorizer = TfidfVectorizer()\n",
    "train_x_vectors = vectorizer.fit_transform(train_x)\n",
    "\n",
    "test_x_vectors = vectorizer.transform(test_x)\n",
    "\n",
    "print(train_x[0])\n",
    "print(train_x_vectors[0].toarray())\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Linear SVM\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['POSITIVE'], dtype='<U8')"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import svm\n",
    "\n",
    "clf_svm = svm.SVC(kernel='linear')\n",
    "\n",
    "clf_svm.fit(train_x_vectors, train_y)\n",
    "\n",
    "test_x[0]\n",
    "\n",
    "clf_svm.predict(test_x_vectors[0])\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Decision Tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['NEGATIVE'], dtype='<U8')"
      ]
     },
     "execution_count": 51,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "clf_dec = DecisionTreeClassifier()\n",
    "clf_dec.fit(train_x_vectors, train_y)\n",
    "\n",
    "clf_dec.predict(test_x_vectors[0])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Naive Bayes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['NEGATIVE'], dtype='<U8')"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.naive_bayes import GaussianNB\n",
    "\n",
    "clf_gnb = DecisionTreeClassifier()\n",
    "clf_gnb.fit(train_x_vectors, train_y)\n",
    "\n",
    "clf_gnb.predict(test_x_vectors[0])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Logistic Regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\linear_model\\logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.\n",
      "  FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array(['POSITIVE'], dtype='<U8')"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "clf_log = LogisticRegression()\n",
    "clf_log.fit(train_x_vectors, train_y)\n",
    "\n",
    "clf_log.predict(test_x_vectors[0])\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8076923076923077\n",
      "0.65625\n",
      "0.6706730769230769\n",
      "0.8028846153846154\n"
     ]
    }
   ],
   "source": [
    "# Mean Accuracy\n",
    "print(clf_svm.score(test_x_vectors, test_y))\n",
    "print(clf_dec.score(test_x_vectors, test_y))\n",
    "print(clf_gnb.score(test_x_vectors, test_y))\n",
    "print(clf_log.score(test_x_vectors, test_y))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([0.80582524, 0.80952381])"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# F1 Scores\n",
    "from sklearn.metrics import f1_score\n",
    "\n",
    "f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])\n",
    "#f1_score(test_y, clf_log.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEUTRAL, Sentiment.NEGATIVE])\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array(['POSITIVE', 'NEGATIVE', 'NEGATIVE'], dtype='<U8')"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_set = ['very fun', \"bad book do not buy\", 'horrible waste of time']\n",
    "new_test = vectorizer.transform(test_set)\n",
    "\n",
    "clf_svm.predict(new_test)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tuning our model (with Grid Search)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\lib\\site-packages\\sklearn\\svm\\base.py:193: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.\n",
      "  \"avoid this warning.\", FutureWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "GridSearchCV(cv=5, error_score='raise-deprecating',\n",
       "             estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,\n",
       "                           decision_function_shape='ovr', degree=3,\n",
       "                           gamma='auto_deprecated', kernel='rbf', max_iter=-1,\n",
       "                           probability=False, random_state=None, shrinking=True,\n",
       "                           tol=0.001, verbose=False),\n",
       "             iid='warn', n_jobs=None,\n",
       "             param_grid={'C': (1, 4, 8, 16, 32), 'kernel': ('linear', 'rbf')},\n",
       "             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,\n",
       "             scoring=None, verbose=0)"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "parameters = {'kernel': ('linear', 'rbf'), 'C': (1,4,8,16,32)}\n",
    "\n",
    "svc = svm.SVC()\n",
    "clf = GridSearchCV(svc, parameters, cv=5)\n",
    "clf.fit(train_x_vectors, train_y)\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.8076923076923077\n"
     ]
    }
   ],
   "source": [
    "print(clf.score(test_x_vectors, test_y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving Model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Save model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pickle\n",
    "\n",
    "with open('./models/sentiment_classifier.pkl', 'wb') as f:\n",
    "    pickle.dump(clf, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Load model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./models/entiment_classifier.pkl', 'rb') as f:\n",
    "    loaded_clf = pickle.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "I loved this book and the previous books in this series. It brings out every emotion you can think of. I look forward to reading more books by this author.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array(['POSITIVE'], dtype='<U8')"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(test_x[0])\n",
    "\n",
    "loaded_clf.predict(test_x_vectors[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import Perceptron\n",
    "\n",
    "clf = Perceptron(tol=1e-3, random_state=0)\n",
    "clf.fit(X, y)  "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
